In this project we will try to understand a data set of customers on a company based on unsupervised learning tools. Because of the huge amount of data we have, first we will do some Data Cleaning and some Feature Analysis; then some PCA and with it we will summarize the data. And finally Factor Analysis and clustering to get results on what this clients are and if we can group them based on their characteristics.
Our main goal is to see if the churn variable (variable that characterizes the customers if they leave the telecom company or not) has any relationship with our analysis, and if we can classify the customers without using the churn variable and they correlate.
First we load the data. The data set was downloaded from a Kaggle post based on a telecom company. The purpose of this data set was to predict the “churn” variable, but we will try to understand the clients into groups and see if they correspond to this variable.
customers.df = read.csv("dataset.csv", stringsAsFactors = TRUE)
glimpse(customers.df)
## Rows: 100,000
## Columns: 100
## $ rev_Mean <dbl> 23.9975, 57.4925, 16.9900, 38.0000, 55.2300, 82.2750,…
## $ mou_Mean <dbl> 219.25, 482.75, 10.25, 7.50, 570.50, 1312.25, 0.00, 6…
## $ totmrc_Mean <dbl> 22.500, 37.425, 16.990, 38.000, 71.980, 75.000, 16.99…
## $ da_Mean <dbl> 0.2475, 0.2475, 0.0000, 0.0000, 0.0000, 1.2375, 0.000…
## $ ovrmou_Mean <dbl> 0.00, 22.75, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 419.…
## $ ovrrev_Mean <dbl> 0.0000, 9.1000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ vceovr_Mean <dbl> 0.0000, 9.1000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ datovr_Mean <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
## $ roam_Mean <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ change_mou <dbl> -157.25, 532.25, -4.25, -1.50, 38.50, 156.75, 0.00, 1…
## $ change_rev <dbl> -18.9975, 50.9875, 0.0000, 0.0000, 0.0000, 8.1450, -0…
## $ drop_vce_Mean <dbl> 0.6666667, 8.3333333, 0.3333333, 0.0000000, 9.6666667…
## $ drop_dat_Mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ blck_vce_Mean <dbl> 0.6666667, 1.0000000, 0.0000000, 0.0000000, 0.6666667…
## $ blck_dat_Mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ unan_vce_Mean <dbl> 6.3333333, 61.3333333, 2.6666667, 0.0000000, 77.00000…
## $ unan_dat_Mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ plcd_vce_Mean <dbl> 52.333333, 263.333333, 9.000000, 3.666667, 222.333333…
## $ plcd_dat_Mean <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ recv_vce_Mean <dbl> 42.3333333, 69.0000000, 0.3333333, 1.3333333, 94.6666…
## $ recv_sms_Mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ comp_vce_Mean <dbl> 45.000000, 193.333333, 6.000000, 3.666667, 137.000000…
## $ comp_dat_Mean <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ custcare_Mean <dbl> 0.0000000, 1.6666667, 0.0000000, 0.0000000, 8.6666667…
## $ ccrndmou_Mean <dbl> 0.0000000, 6.3333333, 0.0000000, 0.0000000, 15.000000…
## $ cc_mou_Mean <dbl> 0.0000000, 5.4633333, 0.0000000, 0.0000000, 11.076666…
## $ inonemin_Mean <dbl> 18.0000000, 53.0000000, 0.3333333, 1.3333333, 66.0000…
## $ threeway_Mean <dbl> 0.0000000, 0.3333333, 0.0000000, 0.0000000, 0.0000000…
## $ mou_cvce_Mean <dbl> 90.643333, 189.396667, 5.426667, 8.410000, 285.233333…
## $ mou_cdat_Mean <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.000…
## $ mou_rvce_Mean <dbl> 97.1766667, 55.2800000, 0.0000000, 0.4133333, 106.330…
## $ owylis_vce_Mean <dbl> 0.0000000, 46.3333333, 0.0000000, 0.3333333, 14.66666…
## $ mouowylisv_Mean <dbl> 0.00000000, 24.21666667, 0.00000000, 0.25666667, 10.8…
## $ iwylis_vce_Mean <dbl> 0.0000000, 6.3333333, 0.0000000, 0.0000000, 0.6666667…
## $ mouiwylisv_Mean <dbl> 0.00000000, 3.69666667, 0.00000000, 0.00000000, 0.366…
## $ peak_vce_Mean <dbl> 58.0000000, 83.6666667, 5.0000000, 1.3333333, 97.3333…
## $ peak_dat_Mean <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ mou_peav_Mean <dbl> 132.600000, 75.333333, 5.193333, 3.380000, 173.476667…
## $ mou_pead_Mean <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.000…
## $ opk_vce_Mean <dbl> 24.0000000, 157.0000000, 1.0000000, 3.6666667, 90.333…
## $ opk_dat_Mean <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ mou_opkv_Mean <dbl> 55.2200000, 169.3433333, 0.2333333, 5.4500000, 218.08…
## $ mou_opkd_Mean <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ drop_blk_Mean <dbl> 1.3333333, 9.3333333, 0.3333333, 0.0000000, 10.333333…
## $ attempt_Mean <dbl> 52.333333, 263.333333, 9.000000, 3.666667, 222.333333…
## $ complete_Mean <dbl> 45.000000, 193.333333, 6.000000, 3.666667, 137.000000…
## $ callfwdv_Mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ callwait_Mean <dbl> 0.3333333, 5.6666667, 0.0000000, 0.0000000, 0.0000000…
## $ churn <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ months <int> 61, 56, 58, 60, 57, 59, 53, 53, 55, 57, 59, 53, 55, 5…
## $ uniqsubs <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 1, 2, 5, 2, 1, 3,…
## $ actvsubs <int> 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2,…
## $ new_cell <fct> U, N, Y, Y, Y, Y, N, Y, Y, N, Y, Y, Y, Y, Y, N, N, N,…
## $ crclscod <fct> A, EA, C, B, A, C, A, B, B, A, A, A, BA, A, C, B, B, …
## $ asl_flag <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,…
## $ totcalls <int> 1652, 14654, 7903, 1502, 4485, 26812, 6279, 4491, 167…
## $ totmou <dbl> 4228.000, 26400.000, 24385.053, 3065.000, 14028.000, …
## $ totrev <dbl> 1504.62, 2851.68, 2155.91, 2000.90, 2181.12, 4033.98,…
## $ adjrev <dbl> 1453.44, 2833.88, 1934.47, 1941.81, 2166.48, 3932.90,…
## $ adjmou <dbl> 4085.00, 26367.00, 24303.05, 3035.00, 13965.00, 40295…
## $ adjqty <int> 1602, 14624, 7888, 1479, 4452, 26362, 6271, 4470, 165…
## $ avgrev <dbl> 29.66, 51.53, 34.54, 40.45, 38.69, 83.68, 58.95, 34.7…
## $ avgmou <dbl> 83.37, 479.40, 433.98, 63.23, 249.38, 857.34, 334.06,…
## $ avgqty <dbl> 32.69, 265.89, 140.86, 30.81, 79.50, 560.89, 120.60, …
## $ avg3mou <int> 272, 305, 12, 8, 558, 1260, 0, 633, 973, 6, 90, 18, 1…
## $ avg3qty <int> 116, 158, 7, 3, 191, 960, 0, 96, 465, 3, 16, 13, 625,…
## $ avg3rev <int> 30, 40, 17, 38, 55, 80, 17, 39, 90, 30, 60, 36, 80, 2…
## $ avg6mou <int> 322, 477, 11, 50, 586, 1187, 0, 719, 915, 54, 123, 36…
## $ avg6qty <int> 136, 275, 6, 25, 196, 853, 0, 112, 434, 7, 32, 26, 62…
## $ avg6rev <int> 38, 48, 17, 40, 80, 78, 17, 58, 90, 34, 64, 36, 76, 2…
## $ prizm_social_one <fct> S, U, S, T, U, U, C, , S, C, C, C, U, C, S, S, T, U, …
## $ area <fct> NORTHWEST/ROCKY MOUNTAIN AREA, CHICAGO AREA, GREAT LA…
## $ dualband <fct> Y, N, N, N, Y, N, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, N,…
## $ refurb_new <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,…
## $ hnd_price <dbl> 149.98999, NA, 29.98999, 29.98999, 149.98999, 129.989…
## $ phones <int> 2, 7, 2, 1, 6, 9, 4, 3, 3, 2, 3, 4, 9, 2, 10, 5, 3, 6…
## $ models <int> 2, 6, 1, 1, 4, 4, 3, 2, 3, 2, 3, 3, 5, 2, 6, 4, 2, 5,…
## $ hnd_webcap <fct> WCMB, WC, NA, NA, WCMB, WCMB, NA, WCMB, NA, WCMB, WCM…
## $ truck <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ rv <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ownrent <fct> O, , O, , R, O, O, R, , O, O, O, O, , O, O, O, O, O, …
## $ lor <int> 15, 1, 7, 6, 5, 1, 8, 0, 3, 8, 15, 4, 15, 1, 11, 4, 3…
## $ dwlltype <fct> S, S, S, M, M, , S, S, M, S, S, S, S, S, S, M, S, S, …
## $ marital <fct> S, S, M, M, S, S, M, M, M, M, S, A, S, U, S, U, U, S,…
## $ adults <int> 1, 1, 2, 4, 1, 1, 3, 2, 3, 2, 5, 3, 3, 1, 3, 3, 2, 5,…
## $ infobase <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…
## $ income <int> 4, 5, 5, 6, 6, 6, 9, 1, 4, 9, 6, 9, 5, 7, 3, 1, 3, 4,…
## $ numbcars <int> 3, 1, 2, 1, 1, 1, NA, 2, 1, 2, NA, NA, NA, NA, 2, NA,…
## $ HHstatin <fct> C, C, C, C, C, C, I, C, C, I, C, , B, A, I, , C, C, C…
## $ dwllsize <fct> A, A, A, D, O, , A, A, E, A, A, A, A, A, A, , A, A, A…
## $ forgntvl <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ethnic <fct> N, Z, N, U, I, U, N, S, F, N, N, N, J, N, N, N, N, N,…
## $ kid0_2 <fct> U, U, U, Y, U, U, U, U, U, U, Y, U, U, U, U, U, U, U,…
## $ kid3_5 <fct> U, U, Y, U, U, U, U, U, U, U, U, U, U, U, U, Y, U, U,…
## $ kid6_10 <fct> U, U, U, U, U, U, U, U, U, U, U, Y, U, U, Y, U, U, U,…
## $ kid11_15 <fct> U, U, U, U, U, U, U, U, U, U, U, Y, U, U, U, U, U, U,…
## $ kid16_17 <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, Y,…
## $ creditcd <fct> Y, Y, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, Y, Y, Y, Y,…
## $ eqpdays <int> 361, 240, 1504, 1812, 434, 458, 852, 231, 700, 601, 4…
## $ Customer_ID <int> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006,…
At first sight, we can see a lot of factor variables and a ton of NA.
First, we will take care of the missing values.
hist(rowMeans(is.na(customers.df)), xlab = c("Missing values average by rows"), main = c())
We can see there are a some rows that have a lot of empty values on it. That is, there are some customers that do not have much information.
Then we will look at the missing values by columns.
indexesEmptyCols = which(colMeans(is.na(customers.df)) != 0)
colsWithNA = sort(colMeans(is.na(customers.df[, indexesEmptyCols])), decreasing = TRUE)
barplot(colsWithNA, las=2)
Here we can clearly see that for the 100 columns that we have, there are five columns that contribute for the most missing values. So we will first remove those columns.
indexesColsToRemove = c()
for (i in names(colsWithNA)[1:5]) {
indexesColsToRemove = c(indexesColsToRemove, which(names(customers.df) == i))
}
customers.df = customers.df[, -indexesColsToRemove]
Now that the columns are almost cleaned we will look at all the rows that have missing values.
indexesEmptyRows = which(rowMeans(is.na(customers.df))!= 0)
print(length(indexesEmptyRows))
## [1] 6069
There are 6000 rows that have some missing values. As it is too small compared to the total number of rows we will remove them.
customers.df = customers.df[-indexesEmptyRows,]
Now, with columns and rows almost cleaned, we are going to look at individual values and see which ones are empty. To do that, first we have to remove the empty levels on the factors.
print(length(which(is.na(customers.df) == TRUE)))
## [1] 0
rm(list = setdiff(ls(), "customers.df"))
Now we are going to take care of the factor variables, because often there are a lot of missing values hidden on those. So let’s get all the levels for each factor variable.
customers.df = droplevels.data.frame(customers.df)
colsWithEmptyFactors.df = data.frame(colName = character(), index = integer(), naNumber = integer(),
stringsAsFactors = FALSE)
for (i in 1:ncol(customers.df)) {
levels = levels(customers.df[, i])
if ("" %in% levels) {
colsWithEmptyFactors.df = rbind(colsWithEmptyFactors.df, data.frame(colName = names(customers.df)[i], index = i, naNumber = 0,
stringsAsFactors = FALSE))
}
}
For each factor column we will numerate the number of missing values.
for (i in 1:nrow(colsWithEmptyFactors.df)) {
index = colsWithEmptyFactors.df$index[i]
level = levels(customers.df[, index])
level[which(level == "")] = NA
levels(customers.df[, index]) = level
colsWithEmptyFactors.df[i, 3] = length(which(is.na(customers.df[, index]) == TRUE))
}
print(colsWithEmptyFactors.df)
## colName index naNumber
## 1 prizm_social_one 71 6334
## 2 area 72 38
## 3 ownrent 80 30225
## 4 dwlltype 81 28518
## 5 infobase 83 19206
## 6 HHstatin 84 34122
## 7 dwllsize 85 34639
Because all the factors except the area or prizm_social_one have too many NAs, then we will remove those columns.
customers.df = customers.df[, -colsWithEmptyFactors.df$index[3:nrow(colsWithEmptyFactors.df)]]
Now, we will get all the rows with missing values.
indexesEmptyRows = which(rowMeans(is.na(customers.df))!= 0)
length(indexesEmptyRows)
## [1] 6370
Because the number is relatively small, we will delete those rows.
customers.df = customers.df[-indexesEmptyRows,]
print(length(which(is.na(customers.df) == TRUE)))
## [1] 0
No there are no more NAs, and we are left with a pretty big data set to work on. And finally we look at the duplicated clients.
length(which(duplicated(customers.df) == TRUE))
## [1] 0
And there are non.
After looking at the data, we have pretty much all the variables that we need to work one, but there are some steps we have to take care first:
REMOVE = "remove"
TO_FACTOR = "toFactor"
editColumn = function(colNames, df, toDo) {
for (col in colNames) {
index = which(names(df) == col)
if (!is_empty(index)) {
if (toDo == REMOVE) {
df = df[, -index]
} else if (toDo == TO_FACTOR) {
df[, index] = as.factor(df[, index])
} else { errorCondition("Could not perform that action.")}
}
}
return(df)
}
customers.df = editColumn(c("forgntvl", "truck", "rv"), customers.df, TO_FACTOR)
customers.df = droplevels.data.frame(customers.df)
row.names(customers.df) = customers.df$Customer_ID
customers_original.df = customers.df
customers.df = editColumn(c("churn", "Customer_ID"), customers.df, REMOVE)
After removing the predicted variables, and transforming into factors we will clean the environment.
rm(list = setdiff(ls(), c("customers.df", "customers_original.df")))
Let’s plot all the data set to look for outliers.
boxplot(customers.df, las=2, col="darkblue")
We can see that there are some variables that are the same height as other, and also there are some outliers. So let’s see if scaling the data helps.
customers_numeric.df = select_if(customers.df, is.numeric)
customers_not_numeric.df = select_if(customers.df, negate(is.numeric))
boxplot(scale(customers_numeric.df), las=2, col="darkblue")
We can see a little bit of improvement, with less outliers. However, we are losing all the variance that the variables have, so let’s see the correlation between them.
R = cor(customers_numeric.df)
ggcorr(customers_numeric.df, cor_matrix = R, label = TRUE)
Because of the immense amount of variables, it is only visible if the plot is zoomed. But we can see there are a lot of correlated variables so we will remove them (only one if two are strongly correlated). This makes sense because there are a lot of variables such as average monthly use over 3, 6, etc months.
R[upper.tri(R)] = 0
diag(R) = 0
customers_numeric_no_corr.df = customers_numeric.df[, !apply(R, 2, function(x) any(abs(x) > 0.9, na.rm = TRUE))]
rm(R)
With this we can see that 22 variables were removed.
First of all, lets compute the PCA scaled and not scaled to see if we can see any difference in the results.
pca = prcomp(customers_numeric_no_corr.df)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 3871.0475 536.18650 531.94380 274.37625 231.89810
## Proportion of Variance 0.9509 0.01824 0.01796 0.00478 0.00341
## Cumulative Proportion 0.9509 0.96915 0.98711 0.99188 0.99530
## PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 154.90526 133.51307 110.74285 88.09443 53.09050 44.61102
## Proportion of Variance 0.00152 0.00113 0.00078 0.00049 0.00018 0.00013
## Cumulative Proportion 0.99682 0.99795 0.99873 0.99922 0.99940 0.99953
## PC12 PC13 PC14 PC15 PC16 PC17
## Standard deviation 37.65613 33.26910 30.81608 27.63984 23.80900 22.18829
## Proportion of Variance 0.00009 0.00007 0.00006 0.00005 0.00004 0.00003
## Cumulative Proportion 0.99962 0.99969 0.99975 0.99979 0.99983 0.99986
## PC18 PC19 PC20 PC21 PC22 PC23
## Standard deviation 21.09227 19.10587 16.31055 14.50608 14.08652 12.59224
## Proportion of Variance 0.00003 0.00002 0.00002 0.00001 0.00001 0.00001
## Cumulative Proportion 0.99989 0.99991 0.99993 0.99994 0.99996 0.99997
## PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 12.03064 9.67548 8.714 7.441 6.95 6.135 4.932 3.987
## Proportion of Variance 0.00001 0.00001 0.000 0.000 0.00 0.000 0.000 0.000
## Cumulative Proportion 0.99997 0.99998 1.000 1.000 1.00 1.000 1.000 1.000
## PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40
## Standard deviation 3.459 3.254 2.6 1.843 1.686 1.393 1.196 1.004 0.9752
## Proportion of Variance 0.000 0.000 0.0 0.000 0.000 0.000 0.000 0.000 0.0000
## Cumulative Proportion 1.000 1.000 1.0 1.000 1.000 1.000 1.000 1.000 1.0000
## PC41 PC42 PC43 PC44 PC45 PC46 PC47
## Standard deviation 0.9578 0.7785 0.7314 0.5075 0.471 0.3659 0.3372
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000 1.0000 1.0000 1.000 1.0000 1.0000
## PC48
## Standard deviation 1.347e-09
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
In the PCA without the scaling we can see the first component describes the most variance of the model, up to 95%.
fviz_screeplot(pca, addlabels = TRUE)
fviz_contrib(pca, choice = "var", axes = 1)
Without scaling now we clearly see that the only important variable is adjqty (adjusted total number of calls) and that is not ideal for a model, because with more than 48 variables just one cannot be the only important one.
Now, let’s do the PCA scaling the data.
pca_scaled = prcomp(customers_numeric_no_corr.df, scale. = TRUE)
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.7682 1.87026 1.68027 1.41888 1.3926 1.33714 1.31562
## Proportion of Variance 0.2958 0.07287 0.05882 0.04194 0.0404 0.03725 0.03606
## Cumulative Proportion 0.2958 0.36868 0.42750 0.46945 0.5099 0.54710 0.58316
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.25567 1.17608 1.04803 1.00326 1.00004 0.98491 0.97153
## Proportion of Variance 0.03285 0.02882 0.02288 0.02097 0.02084 0.02021 0.01966
## Cumulative Proportion 0.61600 0.64482 0.66770 0.68867 0.70951 0.72972 0.74938
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.93902 0.91884 0.90249 0.89385 0.88124 0.8542 0.82717
## Proportion of Variance 0.01837 0.01759 0.01697 0.01665 0.01618 0.0152 0.01425
## Cumulative Proportion 0.76775 0.78534 0.80231 0.81895 0.83513 0.8503 0.86459
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.7959 0.78104 0.74472 0.6999 0.67444 0.64110 0.62981
## Proportion of Variance 0.0132 0.01271 0.01155 0.0102 0.00948 0.00856 0.00826
## Cumulative Proportion 0.8778 0.89049 0.90205 0.9123 0.92173 0.93029 0.93855
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.60046 0.55939 0.53118 0.50361 0.48078 0.47309 0.45717
## Proportion of Variance 0.00751 0.00652 0.00588 0.00528 0.00482 0.00466 0.00435
## Cumulative Proportion 0.94607 0.95259 0.95846 0.96375 0.96856 0.97323 0.97758
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.43751 0.41427 0.39620 0.34701 0.33725 0.32269 0.25189
## Proportion of Variance 0.00399 0.00358 0.00327 0.00251 0.00237 0.00217 0.00132
## Cumulative Proportion 0.98157 0.98514 0.98841 0.99092 0.99329 0.99546 0.99678
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.24544 0.22522 0.16081 0.13258 0.003562 1.474e-10
## Proportion of Variance 0.00126 0.00106 0.00054 0.00037 0.000000 0.000e+00
## Cumulative Proportion 0.99804 0.99909 0.99963 1.00000 1.000000 1.000e+00
fviz_screeplot(pca_scaled, addlabels = TRUE)
fviz_contrib(pca_scaled, choice = "var", axes = 1)
We see that the scaled PCA is much more descriptive than the not scaled. In the non scaled PCA, adjqty (adjusted total number of calls) is the only important variable and there is only one principal component. Whereas in the scaled we have more principal components and more variables. It is true that using the scaled version removes the variance of the sample and also we lose a lot of explained variability of the model, but it is much more descriptive.
Now, let’s look at the other principal components.
fviz_contrib(pca_scaled, choice = "var", axes = 2)
Here, there are less variables that contribute to the PC, but we can realize more or less the variables that are important.
fviz_contrib(pca_scaled, choice = "var", axes = 3)
With the data scaled, we can see that the variance of the model due to the principal components is much more spread. And with the first 3, we can only explain about 50% of the variance of the model, but there are a lot more variables that are important. And we will use this variables to clustering the data into groups. Otherwise we will only be left with one variable and that is not accurate.
rm(pca)
We will see if there are customers that highlight from others.
head(get_pca_ind(pca_scaled)$contrib[,1], n = 20)
## 1000001 1000003 1000004 1000005 1000006 1000007
## 3.025519e-04 8.996525e-04 9.920473e-04 1.988757e-04 1.736548e-02 5.163934e-04
## 1000009 1000010 1000011 1000012 1000013 1000014
## 2.607328e-03 7.915901e-04 5.365838e-06 3.980850e-04 3.083818e-04 8.475781e-04
## 1000015 1000016 1000017 1000018 1000019 1000020
## 7.630032e-03 6.659053e-05 1.949787e-05 8.853703e-05 7.616919e-04 9.712061e-05
## 1000021 1000022
## 1.071428e-02 1.152903e-02
And there are not.
Now, look at each specific customer to see the percentage of variance they contribute.
head(sort(decreasing = TRUE, pca_scaled$x[,1]^2)/(pca_scaled$sdev[1]^2))/dim(customers_numeric.df)[1]
## 1011660 1049099 1001811 1000652 1071600 1000318
## 0.003392062 0.002252876 0.002085621 0.001727743 0.001629091 0.001613368
fviz_contrib(pca_scaled, choice = "ind", axes = 1)
If we do not pay attention to the customers id, we can see that there are some really important customers that explain the most of the variance. However, as we assume that all customers are the same, we cannot know exactly what this means. However maybe if we compare that to the churn variable we can conclude that those are the ones quitting.
What about the top customers?
fviz_contrib(pca_scaled, choice = "ind", axes = 1, top=200)
Here we can tell that the customer contribution variation is like an inverse exponential without that steep slope. That is there are some clients that stand out, but not that many. There are some differences between the customers but not a huge one.
customers_numeric_no_corr.df[order(get_pca_ind(pca_scaled)$contrib[,1], decreasing = TRUE)[1:20], ]
## totmrc_Mean da_Mean vceovr_Mean datovr_Mean roam_Mean change_mou
## 1011660 209.9900 159.3900 155.1750 0.8250 1.9000 -1070.25
## 1049099 199.9900 8.9100 494.0000 0.3900 0.4875 -2254.25
## 1001811 42.9425 2.9700 744.0625 0.0000 0.0000 -2785.00
## 1000652 75.0000 0.9900 432.0750 0.0000 0.0000 135.75
## 1071600 142.4900 5.6925 285.2750 0.9250 36.5675 -403.75
## 1000318 202.0050 55.1925 215.1375 0.0975 0.0975 105.75
## 1085979 209.9900 4.4550 0.0000 0.0000 20.0925 -416.50
## 1063639 341.9250 72.7650 68.2500 0.0000 0.0000 529.25
## 1000572 80.0000 26.7300 188.7250 0.0000 0.0000 -932.25
## 1070494 138.9900 9.4050 183.7750 1.2375 3.8650 1054.50
## 1098072 148.4500 5.1975 285.8000 0.7875 1.3650 345.00
## 1038142 84.9800 57.6675 104.1875 4.5825 9.4975 189.00
## 1039871 75.0000 4.4550 268.8125 1.7550 0.6475 948.75
## 1001077 210.3925 7.6725 83.4375 0.7500 12.1350 -355.00
## 1013042 136.2000 30.9375 890.7625 0.0000 37.0850 -2530.75
## 1028321 56.2500 2.9700 50.1250 0.2925 3.0200 -1291.50
## 1069899 210.3925 17.5725 54.8875 0.0000 16.5400 -322.50
## 1079185 82.4875 0.4950 424.9000 0.3900 0.2925 1295.75
## 1053185 84.9900 0.2475 273.8125 0.0000 11.2800 199.50
## 1057745 195.3925 0.0000 127.3875 0.0000 7.4575 940.25
## change_rev drop_vce_Mean drop_dat_Mean blck_vce_Mean blck_dat_Mean
## 1011660 -154.8250 117.333333 0.000000 9.000000 0.0000000
## 1049099 -478.4850 60.000000 0.000000 58.666667 0.0000000
## 1001811 -524.0600 27.333333 0.000000 40.333333 0.0000000
## 1000652 97.2600 78.666667 0.000000 9.666667 0.0000000
## 1071600 -129.8225 106.666667 0.000000 13.000000 0.0000000
## 1000318 -63.5975 7.666667 0.000000 64.000000 0.0000000
## 1085979 -23.6875 19.666667 0.000000 199.000000 0.3333333
## 1063639 -53.0750 35.000000 0.000000 144.000000 0.0000000
## 1000572 -74.9875 9.333333 0.000000 35.000000 0.0000000
## 1070494 12.3675 88.000000 0.000000 1.666667 0.0000000
## 1098072 -228.8100 73.333333 0.000000 4.666667 0.0000000
## 1038142 2.2325 69.000000 1.333333 13.666667 0.0000000
## 1039871 146.9200 94.000000 0.000000 8.333333 0.0000000
## 1001077 -25.7825 70.333333 1.666667 10.666667 2.0000000
## 1013042 -768.5750 54.666667 0.000000 239.000000 0.0000000
## 1028321 -120.5400 57.666667 0.000000 10.333333 0.0000000
## 1069899 -85.1425 130.666667 0.000000 24.000000 0.0000000
## 1079185 389.8225 130.666667 0.000000 18.000000 0.0000000
## 1053185 70.1500 72.000000 0.000000 9.666667 0.0000000
## 1057745 39.0525 75.000000 0.000000 44.000000 0.0000000
## unan_vce_Mean unan_dat_Mean recv_sms_Mean custcare_Mean cc_mou_Mean
## 1011660 326.0000 0.0000000 0 7.6666667 43.096667
## 1049099 318.6667 0.0000000 0 2.3333333 4.523333
## 1001811 317.0000 0.0000000 0 0.3333333 0.540000
## 1000652 321.0000 0.0000000 0 0.0000000 0.000000
## 1071600 316.3333 0.0000000 0 20.6666667 18.576667
## 1000318 157.0000 0.0000000 0 5.3333333 21.690000
## 1085979 373.6667 0.3333333 0 9.0000000 39.380000
## 1063639 126.6667 0.0000000 0 28.6666667 25.840000
## 1000572 230.3333 0.0000000 0 0.0000000 0.000000
## 1070494 222.3333 0.0000000 0 20.3333333 67.120000
## 1098072 292.0000 0.0000000 0 0.0000000 0.000000
## 1038142 679.6667 0.0000000 0 20.6666667 52.000000
## 1039871 308.6667 0.0000000 0 31.3333333 42.416667
## 1001077 227.6667 0.0000000 0 2.0000000 5.433333
## 1013042 164.0000 0.0000000 0 19.0000000 168.886667
## 1028321 170.6667 0.0000000 0 2.0000000 2.290000
## 1069899 419.0000 0.0000000 0 3.0000000 2.423333
## 1079185 814.3333 0.0000000 0 93.0000000 126.473333
## 1053185 447.3333 0.0000000 0 17.6666667 26.350000
## 1057745 232.6667 0.3333333 0 35.3333333 149.860000
## inonemin_Mean threeway_Mean mou_cvce_Mean mou_rvce_Mean owylis_vce_Mean
## 1011660 693.3333 4.3333333 2753.6333 1957.5600 179.00000
## 1049099 755.3333 0.3333333 1079.8633 1926.7400 379.00000
## 1001811 833.0000 0.0000000 1100.7733 2140.3967 437.00000
## 1000652 283.0000 1.6666667 3701.3633 807.7567 285.00000
## 1071600 489.0000 0.6666667 2361.7533 1327.5300 517.00000
## 1000318 216.3333 0.6666667 2248.9300 1666.5833 343.00000
## 1085979 1075.3333 5.3333333 1197.8633 1492.1900 167.66667
## 1063639 166.0000 10.3333333 2335.1700 1443.1933 393.66667
## 1000572 533.3333 0.3333333 1725.8633 1382.2467 197.00000
## 1070494 779.3333 12.0000000 1433.9333 1956.2267 195.66667
## 1098072 1145.3333 4.3333333 741.6933 1139.4733 315.66667
## 1038142 1093.6667 26.0000000 1458.8933 1502.3000 287.00000
## 1039871 1323.0000 4.0000000 1127.6367 1518.8200 98.33333
## 1001077 741.0000 1.0000000 1987.4967 1820.2667 389.66667
## 1013042 184.0000 6.6666667 2385.5433 1015.5267 118.66667
## 1028321 1852.3333 0.3333333 917.8867 1193.6467 540.66667
## 1069899 452.0000 16.6666667 1676.6933 927.7367 292.66667
## 1079185 297.0000 5.3333333 1763.1867 575.0700 644.33333
## 1053185 448.3333 12.3333333 1910.7700 541.0900 610.33333
## 1057745 336.3333 6.6666667 2846.6700 1617.8833 211.33333
## mouowylisv_Mean iwylis_vce_Mean mouiwylisv_Mean peak_dat_Mean
## 1011660 346.0033 41.00000 104.98000 4.6666667
## 1049099 277.5133 376.66667 476.07667 0.0000000
## 1001811 306.2033 519.33333 743.19000 0.0000000
## 1000652 328.1833 36.00000 65.67000 0.0000000
## 1071600 725.2700 282.00000 659.60667 3.3333333
## 1000318 549.3900 141.66667 434.37000 0.3333333
## 1085979 144.6100 174.66667 151.76667 0.3333333
## 1063639 554.0000 48.33333 99.85000 0.0000000
## 1000572 149.7767 281.66667 387.97667 0.0000000
## 1070494 326.9867 176.00000 250.73333 4.0000000
## 1098072 253.9333 280.00000 292.90667 0.3333333
## 1038142 251.8200 108.66667 125.70667 1.3333333
## 1039871 105.4133 72.00000 56.35000 1.3333333
## 1001077 585.3800 173.00000 342.61000 13.6666667
## 1013042 158.2133 16.66667 38.03667 0.0000000
## 1028321 384.9600 344.66667 318.16667 0.3333333
## 1069899 319.2967 82.00000 133.66667 0.0000000
## 1079185 1187.5633 131.33333 332.71667 0.0000000
## 1053185 674.0000 140.33333 152.48000 0.0000000
## 1057745 351.4033 77.66667 212.97667 0.3333333
## mou_peav_Mean mou_pead_Mean opk_vce_Mean opk_dat_Mean mou_opkv_Mean
## 1011660 2943.1367 9.23000000 803.6667 1.3333333 1768.0600
## 1049099 1585.2967 0.00000000 1121.0000 0.0000000 1421.3033
## 1001811 1395.5967 0.00000000 1429.3333 0.0000000 1845.5767
## 1000652 4015.3467 0.00000000 243.0000 0.0000000 493.7767
## 1071600 1503.5500 6.23666667 1167.3333 5.0000000 2185.7300
## 1000318 3089.5633 0.04333333 417.6667 0.0000000 825.9367
## 1085979 1210.5467 0.57666667 1643.3333 0.6666667 1479.5100
## 1063639 3179.2900 0.00000000 296.0000 0.0000000 599.0800
## 1000572 2761.3167 0.00000000 235.0000 0.0000000 346.7967
## 1070494 1303.8333 10.10000000 1318.0000 1.3333333 2046.3400
## 1098072 1037.9133 1.34333333 980.3333 1.0000000 843.2533
## 1038142 1241.7633 7.18666667 1438.0000 3.6666667 1719.4200
## 1039871 1217.7200 2.16666667 1438.3333 1.6666667 1428.7333
## 1001077 2548.2467 11.55666667 538.6667 24.0000000 1259.5133
## 1013042 2132.0633 0.00000000 425.3333 0.0000000 1269.0000
## 1028321 1022.7233 0.02666667 1399.3333 0.3333333 1088.8133
## 1069899 1434.5233 0.00000000 882.6667 0.0000000 1169.9000
## 1079185 879.0967 0.00000000 710.0000 1.0000000 1459.1633
## 1053185 1346.4233 0.00000000 1112.3333 0.0000000 1105.4333
## 1057745 1777.2500 0.02333333 738.3333 0.0000000 2687.3133
## mou_opkd_Mean drop_blk_Mean complete_Mean callfwdv_Mean callwait_Mean
## 1011660 0.9733333 126.33333 1381.6667 0 182.0000000
## 1049099 0.0000000 118.66667 1119.0000 0 165.0000000
## 1001811 0.0000000 67.66667 1189.3333 0 212.6666667
## 1000652 0.0000000 88.33333 1894.3333 0 146.6666667
## 1071600 6.2766667 119.66667 1354.6667 0 135.6666667
## 1000318 0.0000000 71.66667 1186.6667 0 126.6666667
## 1085979 0.9633333 219.00000 1464.3333 0 199.6666667
## 1063639 0.0000000 179.00000 1255.3333 0 127.0000000
## 1000572 0.0000000 44.33333 1203.0000 0 170.6666667
## 1070494 1.0766667 89.66667 977.0000 0 168.6666667
## 1098072 2.5933333 78.00000 911.3333 0 146.0000000
## 1038142 8.1866667 84.00000 1261.3333 0 2.6666667
## 1039871 2.3100000 102.33333 1056.0000 0 195.0000000
## 1001077 30.9233333 84.66667 944.6667 0 4.0000000
## 1013042 0.0000000 293.66667 843.0000 0 24.0000000
## 1028321 0.5733333 68.00000 1313.6667 0 24.6666667
## 1069899 0.0000000 154.66667 1429.6667 0 85.3333333
## 1079185 0.9933333 148.66667 1003.6667 0 0.3333333
## 1053185 0.0000000 81.66667 1741.6667 0 61.6666667
## 1057745 0.0000000 119.00000 965.0000 0 45.3333333
## months uniqsubs actvsubs adjrev adjqty avgrev avg6mou avg6qty avg6rev
## 1011660 31 2 1 27071.30 68667 902.38 7217 2470 566
## 1049099 13 1 1 6958.35 32050 579.86 6597 3256 779
## 1001811 48 1 1 7172.03 36392 199.22 3457 1639 519
## 1000652 51 2 2 19432.46 83893 396.58 6504 2593 558
## 1071600 8 1 1 2113.88 10996 301.98 3990 1712 333
## 1000318 52 1 1 12625.72 77419 247.56 5589 2176 485
## 1085979 8 1 1 1727.85 19017 246.84 4672 2887 260
## 1063639 12 2 2 5177.66 18781 470.70 5297 1830 429
## 1000572 51 5 5 10061.21 98705 201.22 4674 2673 296
## 1070494 14 1 1 2593.03 19549 199.46 4201 1993 300
## 1098072 48 4 2 8508.59 57416 283.62 2406 1425 333
## 1038142 23 2 1 4811.24 37152 300.70 3369 1840 208
## 1039871 20 1 1 3041.45 41909 160.08 4017 2759 237
## 1001077 48 2 1 11685.87 61539 248.64 5347 1972 413
## 1013042 28 1 1 7436.81 13377 309.87 3428 949 726
## 1028321 23 2 2 2253.18 44228 102.42 3322 2563 129
## 1069899 11 1 1 3273.40 17803 363.71 4272 2071 306
## 1079185 7 3 2 1349.83 4847 224.97 2254 808 225
## 1053185 14 1 1 2792.92 22335 214.84 3525 2140 291
## 1057745 13 1 1 3330.00 12630 277.50 5321 1381 310
## hnd_price phones models eqpdays
## 1011660 99.98999 19 9 58
## 1049099 129.98999 4 3 62
## 1001811 149.98999 4 4 406
## 1000652 199.98999 8 6 162
## 1071600 129.98999 3 2 35
## 1000318 149.98999 8 4 187
## 1085979 149.98999 2 2 30
## 1063639 29.98999 2 2 122
## 1000572 29.98999 7 5 225
## 1070494 199.98999 7 4 56
## 1098072 199.98999 16 8 -1
## 1038142 199.98999 4 4 119
## 1039871 149.98999 8 4 95
## 1001077 199.98999 8 5 126
## 1013042 149.98999 6 3 84
## 1028321 149.98999 5 4 48
## 1069899 99.98999 7 4 17
## 1079185 129.98999 2 2 49
## 1053185 129.98999 3 2 146
## 1057745 199.98999 20 8 44
If we look at the characteristics of each top customer, we do not really see any relationship except for the two most variables adjqty (adjusted total number of calls) and adjrev (adjusted total revenues). and those customers are the ones with the highest ones.
fviz_pca_var(pca_scaled, col.var = "contrib")
Here we can clearly see that there are strong correlation between some variables and that we can possibly sort the clients by this columns.
fviz_pca_biplot(pca_scaled, repel = TRUE)
## Warning: ggrepel: 87554 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Here we confirm that there are differences. The variables that are to the right correspond to the most important variables of the first PC, and the ones that are on the top to the second one.
Let’s see the customers by the first and the second component.
fviz_pca_ind(pca_scaled,
label = "none", # hide individual labels
habillage = customers_original.df$churn, # color by groups
palette = c("#00AFBB", "#E7B800"))
Here it is not clear in this two PCA to differentiate the churn groups. However the overlap can be because it is in dimension 2 and the points live in dimension 100. Now we will use factor analysis and compare the results. Let’s see if we can divide the customers and also if the results are similar to the PCA.
# gives error because the data is too large
# x.f = factanal(customers_numeric_no_corr.df, factors = 5, rotation="none", scores="regression")
colsImportant = c()
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:21]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 2], decreasing = TRUE)[1:8]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:5]))
colsImportant = unique(colsImportant)
As the variables are too big for the factor analysis, we will reduce it. We use PCA with the most imporant variables of the PCs to reduce it. We use the ones that are high enough, given by the graphs above.
colsNotImportant = setdiff(names(customers_numeric_no_corr.df), colsImportant)
customers_numeric_no_corr_small.df = customers_numeric_no_corr.df
for (i in colsNotImportant) {
index = which(names(customers_numeric_no_corr_small.df) == i)
if (length(index) != 0) {
customers_numeric_no_corr_small.df = customers_numeric_no_corr_small.df[, -index]
}
}
After removeing the not so important variables we do the Factor Analysis.
x.f = factanal(scale(customers_numeric_no_corr_small.df), factors = 5, rotation="none", scores="regression")
cbind(x.f$loadings, x.f$uniquenesses)
## Factor1 Factor2 Factor3 Factor4
## totmrc_Mean 0.53855899 0.052369938 0.0675643340 0.2445909659
## datovr_Mean 0.05284515 0.492075584 0.0803169986 -0.0108893091
## drop_vce_Mean 0.59235984 -0.032664000 0.3095817836 0.0355732333
## drop_dat_Mean 0.03072469 0.483688287 0.0883024561 -0.0024309374
## blck_dat_Mean 0.02606667 0.209065067 0.0440793503 -0.0005290235
## unan_vce_Mean 0.65630899 -0.055839058 0.4203812132 -0.0008397362
## unan_dat_Mean 0.04088886 0.345591459 0.0551863893 -0.0035888717
## inonemin_Mean 0.58090242 -0.084749786 0.4662057519 0.1332624854
## mou_cvce_Mean 0.93069733 -0.002614453 -0.0141734684 -0.1527130943
## mou_rvce_Mean 0.89421686 0.005914394 -0.0630970632 0.1580234480
## owylis_vce_Mean 0.71277751 -0.032197498 0.3891763949 0.0708396563
## mouowylisv_Mean 0.68895981 -0.001346128 0.1514976843 -0.0064457545
## iwylis_vce_Mean 0.56603953 -0.053063088 0.3048957010 0.0802807222
## mouiwylisv_Mean 0.58762256 0.003304955 0.0002544016 0.0089540951
## peak_dat_Mean 0.07385645 0.833785092 0.1359041825 0.0127266605
## mou_peav_Mean 0.87389769 -0.002093481 -0.0265600962 0.4480372365
## mou_pead_Mean 0.06888616 0.740078717 0.0933773012 0.0104945287
## opk_vce_Mean 0.77213653 -0.071461007 0.5127754955 -0.1556569717
## opk_dat_Mean 0.08801645 0.813932604 0.1542809227 -0.0256235487
## mou_opkv_Mean 0.88465594 0.003012529 -0.0358644036 -0.4553836517
## mou_opkd_Mean 0.06291549 0.595742734 0.0862516667 -0.0234392301
## drop_blk_Mean 0.57111855 0.013738250 0.3031792554 0.0101308494
## complete_Mean 0.85049239 0.005027087 0.4625418001 0.1168874594
## callwait_Mean 0.61248782 -0.063927624 0.2754300741 0.1755496122
## adjqty 0.50878967 -0.032290371 0.2982310055 0.2845531180
## avgrev 0.62604282 0.061862832 0.1179593977 0.2704399729
## avg6mou 0.89154531 0.025405445 0.1442287252 0.0313184693
## avg6qty 0.75471724 -0.025153759 0.4623494786 0.1879025962
## avg6rev 0.65380407 0.069354325 0.0955385001 0.2925922964
## Factor5
## totmrc_Mean -0.090440234 0.63465199
## datovr_Mean -0.001762682 0.74860201
## drop_vce_Mean 0.013234005 0.55078548
## drop_dat_Mean -0.007924692 0.75722698
## blck_dat_Mean 0.002085932 0.95348045
## unan_vce_Mean -0.013848103 0.38922335
## unan_dat_Mean -0.010431206 0.87581152
## inonemin_Mean 0.323499082 0.31561092
## mou_cvce_Mean -0.327158820 0.00500000
## mou_rvce_Mean 0.408690117 0.00500000
## owylis_vce_Mean 0.084330788 0.32726973
## mouowylisv_Mean 0.017974806 0.50199427
## iwylis_vce_Mean 0.305330467 0.48414804
## mouiwylisv_Mean 0.254505903 0.58988678
## peak_dat_Mean -0.019793636 0.28034492
## mou_peav_Mean -0.175740166 0.00500000
## mou_pead_Mean -0.033018032 0.43761949
## opk_vce_Mean 0.259058323 0.04441989
## opk_dat_Mean -0.001130737 0.30531189
## mou_opkv_Mean 0.071715371 0.00500000
## mou_opkd_Mean -0.015929955 0.63287576
## drop_blk_Mean -0.014028861 0.58146595
## complete_Mean -0.048118017 0.04671952
## callwait_Mean 0.217384431 0.46681364
## adjqty 0.084684189 0.56305404
## avgrev -0.062035760 0.51331861
## avg6mou 0.004470705 0.18269654
## avg6qty 0.143947208 0.15996598
## avg6rev -0.059521766 0.46946902
This is good but let’s graph each factor and see if we can deduce some things.
par(mfrow=c(3,1))
barplot(x.f$loadings[,1], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,2], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,3], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,4], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,5], las=2, col="darkblue", ylim = c(-1, 1))
In the graph, at first sight we can discard the 4, 5 factors as they are not pretty much meaningful in any manner. Also the factor 3 does not look like anything that important.
However, factors 1 and 2 show really interesting inverse correlations. They differ on the exact variables and show that the characteristics of the two type of customers are opposite. They are the contrary, this could mean that ones are the ones with churn = 1 (left the company) and the others the ones that stay. Also, looking at the data and the variables, the first factor can be the customers that remain in the company, and the second factor the ones that leave the company.
This means that the first factors can contribute to the churn variable to be negative, and the opposite for the second factor.
par(mfrow=c(3,1))
barplot(x.f$loadings[,1], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,2], las=2, col="darkblue", ylim = c(-1, 1))
rm(list = c("colsImportant", "colsNotImportant", "i", "index")) # clean the environment.
Let’s see if now with our previous hypothesis we can group the customers in the ones that left and the ones that did not leave.
First of all, let’s see whats is the optimal number of clusters. However, our data is too large to test it, so we are gonna do a simple sample with cross validation to see the optimal value.
nFolds = 6
folds = sample(rep(1:nFolds, length.out = nrow(customers_numeric_no_corr_small.df)))
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 1), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 3), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 6), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))
We just print a couple of graphs because otherwise the pc cannot handle it. After looking at each portion of the data set, it looks like they are fairly similar and that the optimal number of cluster could be around 3,4 or 5. This graph means that the minimum distance from the cluster are when they are 3 to 5.
Now, let’s try to see what is the optimal number of clusters looking at the silhouette distance, that is the average width of the silhouette graph.
for (i in 1:nFolds) {
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == i), ]
print(fviz_nbclust(X, kmeans, method = 'silhouette'))
}
Here, the graph indicates that mostly 3 clusters is the base clustering that we can make. And finally we will use a simple bostrapping to really know the best. However, as bostrapping is really computationally expensive, we will do it for just one fold.
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 1), ]
fviz_nbclust(X, kmeans, method = 'gap_stat', k.max = 8)
And after a while, in the graph we can see that the best number is 3 to 4.
If we group all the conclusions we know that the best number of clusters is 3, so that is what we are going to use. Also we are going to scale all the data as we explained earlier to get more variate readings.
Let’s start with a normal kmeans and se the clusters that it makes.
X = as.data.frame(scale(customers_numeric_no_corr_small.df))
fit = kmeans(X, centers=3, nstart=100)
groups = fit$cluster
Now, we will graph the number of ocurrencies for each group.
barplot(table(groups), col="blue")
Here we see that there is a group with most of the customers, and other not so big. However, the third group is almost ineligible. The hypothesis could be as the one we had before. One group can correspond to customers that left the company and other to the ones that remain. First, we will develop this cluster further, and then let’s see if we can get the same hypothesis with other clustering methods.
centers = fit$centers
for (i in 1:3) {
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,100), main=paste("Cluster", i,": Group center in blue, global center in red"))
print(points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19))
}
## NULL
## NULL
## NULL
Here, we can clearly see that the group with less values, is like a mix cluster. The first cluster has much more higher values that the second cluster, and this can mean that the customers from the first group are more radical and therefore have more probabilty to leave the company.
What about plotting the clusters?
fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+scale_fill_brewer(palette="Paired")
Here we can see some minor differences and see that the most populated cluster (cluster 2) is at the left foremost part, followed by the first cluster and finally the last. This shows that number 3 are the most extreme customers and 2 the most common.
What about two clusters?
fit2 = kmeans(customers_numeric_no_corr_small.df, centers=2, nstart=100)
groups2 = fit2$cluster
barplot(table(groups2), col="blue")
centers2 = fit2$centers
for (i in 1:2) {
bar1=barplot(centers2[i,], las=2, col="darkblue", ylim=c(-2,100), main=paste("Cluster", i,": Group center in blue, global center in red"))
print(points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19))
}
## NULL
## NULL
With two clusters we can see the same as the previous test but only the two most populated clusters. That is why we are going to use 3 clusters as it is more precise.
fviz_cluster(fit2, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+scale_fill_brewer(palette="Paired")
Just the same so we will delete it.
rm(list = c("fit2", "groups2"))
#fit.kmeans <- eclust(X, "kmeans", stand=TRUE, k=3)
#fviz_silhouette(fit.kmeans)
Because the data set is too large, we cannot compute the silhouette distance. But we will use adujustedRandIndex to verify if we get similar results with other clustering.
S_x = cov(X)
iS = solve(S_x)
e = eigen(iS)
V = e$vectors
B = V %*% diag(sqrt(e$values)) %*% t(V)
Xtil = scale(X,scale = FALSE)
X_maha = as.data.frame(Xtil %*% B)
Here we get the mahalanobis distance instead of the euclidean.
fit.mahalanobis = kmeans(X_maha, centers=3, nstart=100)
groups = fit.mahalanobis$cluster
centers=fit.mahalanobis$centers
colnames(centers)=colnames(X)
centers
## totmrc_Mean datovr_Mean drop_vce_Mean drop_dat_Mean blck_dat_Mean
## 1 0.08227674 0.05975652 0.1962781 0.038779652 -0.023503668
## 2 -0.01407061 -0.01199369 -0.0336102 -0.006952899 -0.006260222
## 3 0.02877424 7.00291711 0.2400984 1.276549078 40.441387282
## unan_vce_Mean unan_dat_Mean inonemin_Mean mou_cvce_Mean mou_rvce_Mean
## 1 0.2021039 0.0001137759 0.17460727 0.4746795 0.6589273
## 2 -0.0344434 -0.0073110620 -0.02992108 -0.0812676 -0.1125608
## 3 -0.3996330 28.6917362272 0.29912291 0.5200136 -0.2656295
## owylis_vce_Mean mouowylisv_Mean iwylis_vce_Mean mouiwylisv_Mean peak_dat_Mean
## 1 0.4245518 0.25294551 0.17475016 0.28971835 0.018624314
## 2 -0.0724663 -0.04316964 -0.02989372 -0.04951674 -0.001858567
## 3 -0.3973632 -0.25785128 0.09536336 -0.01530147 -5.213056272
## mou_peav_Mean mou_pead_Mean opk_vce_Mean opk_dat_Mean mou_opkv_Mean
## 1 -0.039239086 -0.011660395 0.90195668 0.030409262 1.1467535
## 2 0.006624845 -0.003516711 -0.15414852 -0.004628858 -0.1960438
## 3 0.323314969 21.680425029 -0.07888063 -2.238560808 0.1294568
## mou_opkd_Mean drop_blk_Mean complete_Mean callwait_Mean adjqty
## 1 0.026568443 0.29415167 0.29642035 0.06478493 0.08632330
## 2 -0.006221988 -0.05051968 -0.05068710 -0.01104278 -0.01476367
## 3 6.613490910 0.94941445 0.08232185 -0.12082083 0.03426782
## avgrev avg6mou avg6qty avg6rev
## 1 0.06840964 0.65719883 0.2310926 0.014523365
## 2 -0.01176314 -0.11231681 -0.0395699 -0.002545864
## 3 0.27582883 -0.06328229 0.2753835 0.249598249
After doing the mahalanobis distance kmeans, let’s look at the distribution of the clusters.
barplot(table(groups), col="blue")
So looks like they are the same! The only thing that changes is the reference number of the clusters but that does not really matter.
Let’s see if the clusters centers are similar to the euclidian distance kmeans.
for (i in 1:3) {
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)
}
Here, it shows that they are not really the same. It is a little bit weird because the cluster that did not matter has a lot of outlier variables , whereas the most populated cluster is almost 0. Let’s do a little bit more research and see what it is going on.
fviz_cluster(fit.mahalanobis, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+scale_fill_brewer(palette="Paired")
Here we can conclude that the most populated clusters remian the same, however cluster number one is not the extreme one, it is like the outlier variables.
adjustedRandIndex(fit$cluster, fit.mahalanobis$cluster)
## [1] 0.3844098
Here we see that they are not really that similar (<<1) but this could be because of the least populated cluster. In the end, I think that the mahalanobis cluster is more accurate as it divides de data in to and also the outliers. We will try PAM now and see if this hypothesis is supported.
Partitioning (clustering) of the data into k clusters around medoids.
# fit.pam = eclust(X, "pam", stand=TRUE, k=3, graph=F)
# ERROR pam only allows mas of 65536
PAM cannot handle a lot of data and that is why I have removed half of the rows. However, we can take in mind that the results in PAM will not be as accurate because we are using a sample, half of what we usually have. It is true that I picked random samples so it can be more or less accurate, but we have to take it in mind. Also we will remove some variables because otherwise it crashes.
colsImportant = c()
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:10]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 2], decreasing = TRUE)[1:4]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:3]))
colsImportant = unique(colsImportant)
As the variables are too big for the factor analysis, we will reduce it. We use PCA with the most imporant variables of the PCs to reduce it. We use the ones that are high enough, given by the graphs above.
colsNotImportant = setdiff(names(X), colsImportant)
X = scale(customers_numeric_no_corr_small.df)
for (i in colsNotImportant) {
index = which(names(X) == i)
if (length(index) != 0) {
X = X[, -index]
}
}
nFolds = 3
folds = sample(rep(1:nFolds, length.out = nrow(X)))
X = as.data.frame(X[which((folds == 1)), ])
#fit.pam = eclust(X, "pam", k=3, graph=F)
#fviz_cluster(fit.pam, data = X, geom = c("point"), pointsize=1)+ theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = F)+scale_fill_brewer(palette="Paired")
#adjustedRandIndex(fit$cluster, fit.pam$clustering)
Even if we try to reduce the data set is does not work. It looks like PAM cannot be performed on this big data set, therefore we do not do it. Also we will keep using the super small data set for the others, because otherwise R breaks when we try to perform the other classifications.
Let’s start with the Gaussian kernel kmeans.
#fit.ker = kkmeans(as.matrix(X), centers=3, kernel="rbfdot") # Radial Basis kernel (Gaussian)
#object.ker = list(data = X, cluster = fit.ker@.Data)
#fviz_cluster(object.ker, geom = c("point"), ellipse=F,pointsize=1)
#adjustedRandIndex(fit$cluster, fit.ker$cluster)
The same as PAM, Kernel Kmeans looks like cannot be performed in this data set. We could try further reduce the data set, but it will be useless compared to the original data set because our conclusion will be too far out the original ones and most probably not correct.
What about hierarchical clustering? Is is similar to the normal kmeas that we perform at the beginning? (All the Hierarchical Cluster is commented because it was breaking R whenever I tried to perform the html.)
#d = dist(scale(X), method = "euclidean")
#hc = hclust(d, method = "average")
#hc$labels = names
#plot(hc)
Here everything is a mess and we cannot differentiate anything. Let’s try using ggplot and see if we can draw some conclusions.
# fviz_dend(x = hc, k = 3,color_labels_by_k = TRUE,cex = 0.8,type = "phylogenic",repel = TRUE)+theme(axis.text.x=element_blank(),axis.text.y=element_blank())
After more than 4 hours waiting, nothing showed so we decided to not to do the dendogram. This is because it is really constitutionally expensive for a data set for so many entries. Moreover, the dendogram would not be that useful as there should be only two types of customers and all in the same level.
#groups.hc = cutree(hc, k = 3)
#fit.small = kmeans(X, centers=3, nstart=100)
We get a kmeans with the small data set.
#adjustedRandIndex(fit.small$cluster, groups.hc)
Even though using almost the same data set, they do not look anything similar. This could be because we are using a small data set, but mainly is because a dendogram is not the best idea for classification for our data. (It gives -0.552e-5)
Finally let’s try the Expectation-Maximization clustering. This is based in kmeans but uses probabilities for clusters to assign a customer for each cluster. The goal is to maximize the likelihood of the data, where each customer as a certain probability to fall inside each cluster.
#res.Mclust <- Mclust(scale(X))
#summary(res.Mclust)
#head(res.Mclust$classification)
#fviz_mclust(object = res.Mclust, what = "BIC", pallete = "jco") + scale_x_discrete(limits = c(1:10))
The EM Clustering does not work with this data as it is too big. It happens the same as with PAM. Therefore no clusterplot can be done neither.
#fviz_mclust(object = res.Mclust, what = "classification", geom = "point", pallete = "jco")
#adjustedRandIndex(fit$clusters, res.Mclust$classification)
In the end, after trying all the possible clusters we finally decided that the normal kmeans with the mahalanobis distance as it is the best one that sorts the population. This is because as it is a simple model, it can be performed in really big data sets (such as this ones) whereas other more complex kmeans cannot (as explained earlier). We could also choose the euclidean distance kmeans but the mahalanobis sorts better the data as explain previously.
In the end, we have clearly seen that everything that we did correlated with two groups.
PCA gave us the conclusion that there were some variables strongly related with the output and seen those. Moreover, based on those variables we saw that the most important variables in on PC were characteristic of those who do not tend to leave the company and the opposite for the other PC.
For the Factor Analysis, we saw that there were only two really important factors as the PCA showed as. And also we could find two latent variables that were strongly correlated to the churn variable.
Finally, for the clustering part, the only clusters that worked where the simple ones. However, they really sorted the data pretty well, especially the mahalanobis distance kmeans that sorted the data in two really clear cluster and another with all the outliers for the second PC.
To sum up, we could see which were the variables most imporant for each group and then we could predict if a customer will leave the company depending on those. But for that, it is better to use some supervised learning tools.